The purpose of this project is to do a comprehensive work on a provided Hepatitis data set. The work will involve cleaning of data, application of correct statistical methods on the data and complete analysis of the data. It would also include correct and adequate interpretation and discussion on data, graphs, tables and results.
The following areas were covered during the project work:
Relevant Information:
Number of Instances: 155
Number of Attributes: 20 (including the class attribute)
The attribute information are:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Load data in Python
initial_data = pd.read_csv("hepatitis.txt", sep = ",", header = None)
initial_data
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 30 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 1.00 | 85 | 18 | 4.0 | ? | 1.0 |
| 1 | 2 | 50 | 1 | 1 | 2 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 0.90 | 135 | 42 | 3.5 | ? | 1.0 |
| 2 | 2 | 78 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0.70 | 96 | 32 | 4.0 | ? | 1.0 |
| 3 | 2 | 31 | 1 | ? | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0.70 | 46 | 52 | 4.0 | 80 | 1.0 |
| 4 | 2 | 34 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1.00 | ? | 200 | 4.0 | ? | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 150 | 1 | 46 | 1 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 7.60 | ? | 242 | 3.3 | 50 | 2.0 |
| 151 | 2 | 44 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 0.90 | 126 | 142 | 4.3 | ? | 2.0 |
| 152 | 2 | 61 | 1 | 1 | 2 | 1 | 1 | 2 | 1 | 1 | 2 | 1 | 2 | 2 | 0.80 | 75 | 20 | 4.1 | ? | 2.0 |
| 153 | 2 | NaN | 2 | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 1 | 2 | 1 | 1.50 | 81 | 19 | 4.1 | 48 | 2.0 |
| 154 | 1 | 43 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 2 | 1.20 | 100 | 19 | 3.1 | 42 | 2.0 |
155 rows × 20 columns
initial_data.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 30 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 1.00 | 85 | 18 | 4.0 | ? | 1.0 |
| 1 | 2 | 50 | 1 | 1 | 2 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 0.90 | 135 | 42 | 3.5 | ? | 1.0 |
| 2 | 2 | 78 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0.70 | 96 | 32 | 4.0 | ? | 1.0 |
| 3 | 2 | 31 | 1 | ? | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0.70 | 46 | 52 | 4.0 | 80 | 1.0 |
| 4 | 2 | 34 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1.00 | ? | 200 | 4.0 | ? | 1.0 |
# Display details of data.
initial_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 155 entries, 0 to 154 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 155 non-null int64 1 1 154 non-null object 2 2 155 non-null int64 3 3 153 non-null object 4 4 155 non-null int64 5 5 154 non-null object 6 6 155 non-null object 7 7 155 non-null object 8 8 155 non-null object 9 9 155 non-null object 10 10 154 non-null object 11 11 155 non-null object 12 12 155 non-null object 13 13 155 non-null object 14 14 153 non-null object 15 15 154 non-null object 16 16 155 non-null object 17 17 154 non-null object 18 18 155 non-null object 19 19 154 non-null float64 dtypes: float64(1), int64(3), object(16) memory usage: 24.3+ KB
display(initial_data.dtypes.value_counts())
object 16 int64 3 float64 1 dtype: int64
Observations:
# Checking for missing data
initial_data.isnull().sum()
0 0 1 1 2 0 3 2 4 0 5 1 6 0 7 0 8 0 9 0 10 1 11 0 12 0 13 0 14 2 15 1 16 0 17 1 18 0 19 1 dtype: int64
initial_data.isnull().sum().sum()
10
print("The total number of missing data is", initial_data.isnull().sum().sum())
The total number of missing data is 10
Analysis so far shows that there is evidence of additional missing data which is being represented with "?" and ".". The data also has misleading values such as '9999' and '99999'. This needs to be identified and treated in the data. In order to identify the missing data, I will reload the data and use the function - na_values.
# Load data in Python
data = pd.read_csv("hepatitis.txt", sep = ",", header = None, na_values=['?','.','9999','99999'])
data
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 30.0 | 2 | 1.0 | 2 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.0 | 18.0 | 4.0 | NaN | 1.0 |
| 1 | 2 | 50.0 | 1 | 1.0 | 2 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.0 | 42.0 | 3.5 | NaN | 1.0 |
| 2 | 2 | 78.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.0 | 32.0 | 4.0 | NaN | 1.0 |
| 3 | 2 | 31.0 | 1 | NaN | 1 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.0 | 52.0 | 4.0 | 80.0 | 1.0 |
| 4 | 2 | 34.0 | 1 | 2.0 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | NaN | 200.0 | 4.0 | NaN | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 150 | 1 | 46.0 | 1 | 2.0 | 2 | 1.0 | 1.0 | 1.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 1.0 | 7.6 | NaN | 242.0 | 3.3 | 50.0 | 2.0 |
| 151 | 2 | 44.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 126.0 | 142.0 | 4.3 | NaN | 2.0 |
| 152 | 2 | 61.0 | 1 | 1.0 | 2 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 | 2.0 | 0.8 | 75.0 | 20.0 | 4.1 | NaN | 2.0 |
| 153 | 2 | NaN | 2 | 1.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.5 | 81.0 | 19.0 | 4.1 | 48.0 | 2.0 |
| 154 | 1 | 43.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.2 | 100.0 | 19.0 | 3.1 | 42.0 | 2.0 |
155 rows × 20 columns
data.columns = ['Class', 'Age', 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise', 'Anorexia', 'Liver_Big', 'Liver_Firm', 'Spleen_Palpable', 'Spiders', 'Ascites', 'Varices', 'Bilirubin', 'Alk_Phosphate', 'Sgot', 'Albumin', 'Protime', 'Histology']
data
| Class | Age | Sex | Steroid | Antivirals | Fatigue | Malaise | Anorexia | Liver_Big | Liver_Firm | Spleen_Palpable | Spiders | Ascites | Varices | Bilirubin | Alk_Phosphate | Sgot | Albumin | Protime | Histology | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 30.0 | 2 | 1.0 | 2 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.0 | 18.0 | 4.0 | NaN | 1.0 |
| 1 | 2 | 50.0 | 1 | 1.0 | 2 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.0 | 42.0 | 3.5 | NaN | 1.0 |
| 2 | 2 | 78.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.0 | 32.0 | 4.0 | NaN | 1.0 |
| 3 | 2 | 31.0 | 1 | NaN | 1 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.0 | 52.0 | 4.0 | 80.0 | 1.0 |
| 4 | 2 | 34.0 | 1 | 2.0 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | NaN | 200.0 | 4.0 | NaN | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 150 | 1 | 46.0 | 1 | 2.0 | 2 | 1.0 | 1.0 | 1.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 1.0 | 7.6 | NaN | 242.0 | 3.3 | 50.0 | 2.0 |
| 151 | 2 | 44.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 126.0 | 142.0 | 4.3 | NaN | 2.0 |
| 152 | 2 | 61.0 | 1 | 1.0 | 2 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 | 2.0 | 0.8 | 75.0 | 20.0 | 4.1 | NaN | 2.0 |
| 153 | 2 | NaN | 2 | 1.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.5 | 81.0 | 19.0 | 4.1 | 48.0 | 2.0 |
| 154 | 1 | 43.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.2 | 100.0 | 19.0 | 3.1 | 42.0 | 2.0 |
155 rows × 20 columns
# Displays the head of data.
data.head()
| Class | Age | Sex | Steroid | Antivirals | Fatigue | Malaise | Anorexia | Liver_Big | Liver_Firm | Spleen_Palpable | Spiders | Ascites | Varices | Bilirubin | Alk_Phosphate | Sgot | Albumin | Protime | Histology | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 30.0 | 2 | 1.0 | 2 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.0 | 18.0 | 4.0 | NaN | 1.0 |
| 1 | 2 | 50.0 | 1 | 1.0 | 2 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.0 | 42.0 | 3.5 | NaN | 1.0 |
| 2 | 2 | 78.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.0 | 32.0 | 4.0 | NaN | 1.0 |
| 3 | 2 | 31.0 | 1 | NaN | 1 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.0 | 52.0 | 4.0 | 80.0 | 1.0 |
| 4 | 2 | 34.0 | 1 | 2.0 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | NaN | 200.0 | 4.0 | NaN | 1.0 |
# Displays the tail of data.
data.tail()
| Class | Age | Sex | Steroid | Antivirals | Fatigue | Malaise | Anorexia | Liver_Big | Liver_Firm | Spleen_Palpable | Spiders | Ascites | Varices | Bilirubin | Alk_Phosphate | Sgot | Albumin | Protime | Histology | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 150 | 1 | 46.0 | 1 | 2.0 | 2 | 1.0 | 1.0 | 1.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 1.0 | 7.6 | NaN | 242.0 | 3.3 | 50.0 | 2.0 |
| 151 | 2 | 44.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 126.0 | 142.0 | 4.3 | NaN | 2.0 |
| 152 | 2 | 61.0 | 1 | 1.0 | 2 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 | 2.0 | 0.8 | 75.0 | 20.0 | 4.1 | NaN | 2.0 |
| 153 | 2 | NaN | 2 | 1.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.5 | 81.0 | 19.0 | 4.1 | 48.0 | 2.0 |
| 154 | 1 | 43.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.2 | 100.0 | 19.0 | 3.1 | 42.0 | 2.0 |
# Code to display the number of rows and columns
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")
There are 155 rows and 20 columns.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 155 entries, 0 to 154 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Class 155 non-null int64 1 Age 153 non-null float64 2 Sex 155 non-null int64 3 Steroid 151 non-null float64 4 Antivirals 155 non-null int64 5 Fatigue 153 non-null float64 6 Malaise 153 non-null float64 7 Anorexia 154 non-null float64 8 Liver_Big 144 non-null float64 9 Liver_Firm 144 non-null float64 10 Spleen_Palpable 149 non-null float64 11 Spiders 149 non-null float64 12 Ascites 148 non-null float64 13 Varices 150 non-null float64 14 Bilirubin 147 non-null float64 15 Alk_Phosphate 118 non-null float64 16 Sgot 151 non-null float64 17 Albumin 136 non-null float64 18 Protime 87 non-null float64 19 Histology 154 non-null float64 dtypes: float64(17), int64(3) memory usage: 24.3 KB
display(data.dtypes.value_counts())
float64 17 int64 3 dtype: int64
Observations:
# Checking for missing data
data.isnull().sum()
Class 0 Age 2 Sex 0 Steroid 4 Antivirals 0 Fatigue 2 Malaise 2 Anorexia 1 Liver_Big 11 Liver_Firm 11 Spleen_Palpable 6 Spiders 6 Ascites 7 Varices 5 Bilirubin 8 Alk_Phosphate 37 Sgot 4 Albumin 19 Protime 68 Histology 1 dtype: int64
# Percentage calculation of missing values.
pd.DataFrame(data={'% of Missing Values':round(data.isna().sum()/data.isna().count()*100,2)})
| % of Missing Values | |
|---|---|
| Class | 0.00 |
| Age | 1.29 |
| Sex | 0.00 |
| Steroid | 2.58 |
| Antivirals | 0.00 |
| Fatigue | 1.29 |
| Malaise | 1.29 |
| Anorexia | 0.65 |
| Liver_Big | 7.10 |
| Liver_Firm | 7.10 |
| Spleen_Palpable | 3.87 |
| Spiders | 3.87 |
| Ascites | 4.52 |
| Varices | 3.23 |
| Bilirubin | 5.16 |
| Alk_Phosphate | 23.87 |
| Sgot | 2.58 |
| Albumin | 12.26 |
| Protime | 43.87 |
| Histology | 0.65 |
data.isnull().sum().sum()
194
Observation:
print("The total number of missing data is", data.isnull().sum().sum())
The total number of missing data is 194
The introduction of the function na_values in the importation of the data of the data has increased the number of missing values from 10 to 193. Also the data type has been converted from object to float.
Treating missing values is an important step in cleaning the data and making it ready for further analysis and usage. It is thus important to correctly identify them. Having an idea of how these values are present can give us directions in treating them. In this data, we will be imputing the data with using the function Bfill.
data = data.fillna(method='bfill')
data
| Class | Age | Sex | Steroid | Antivirals | Fatigue | Malaise | Anorexia | Liver_Big | Liver_Firm | Spleen_Palpable | Spiders | Ascites | Varices | Bilirubin | Alk_Phosphate | Sgot | Albumin | Protime | Histology | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 30.0 | 2 | 1.0 | 2 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 85.0 | 18.0 | 4.0 | 80.0 | 1.0 |
| 1 | 2 | 50.0 | 1 | 1.0 | 2 | 1.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 135.0 | 42.0 | 3.5 | 80.0 | 1.0 |
| 2 | 2 | 78.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 96.0 | 32.0 | 4.0 | 80.0 | 1.0 |
| 3 | 2 | 31.0 | 1 | 2.0 | 1 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.7 | 46.0 | 52.0 | 4.0 | 80.0 | 1.0 |
| 4 | 2 | 34.0 | 1 | 2.0 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 95.0 | 200.0 | 4.0 | 75.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 150 | 1 | 46.0 | 1 | 2.0 | 2 | 1.0 | 1.0 | 1.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 1.0 | 7.6 | 126.0 | 242.0 | 3.3 | 50.0 | 2.0 |
| 151 | 2 | 44.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.9 | 126.0 | 142.0 | 4.3 | 48.0 | 2.0 |
| 152 | 2 | 61.0 | 1 | 1.0 | 2 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 | 2.0 | 0.8 | 75.0 | 20.0 | 4.1 | 48.0 | 2.0 |
| 153 | 2 | 43.0 | 2 | 1.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.5 | 81.0 | 19.0 | 4.1 | 48.0 | 2.0 |
| 154 | 1 | 43.0 | 1 | 2.0 | 2 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.2 | 100.0 | 19.0 | 3.1 | 42.0 | 2.0 |
155 rows × 20 columns
# Checking for missing data
data.isnull().sum()
Class 0 Age 0 Sex 0 Steroid 0 Antivirals 0 Fatigue 0 Malaise 0 Anorexia 0 Liver_Big 0 Liver_Firm 0 Spleen_Palpable 0 Spiders 0 Ascites 0 Varices 0 Bilirubin 0 Alk_Phosphate 0 Sgot 0 Albumin 0 Protime 0 Histology 0 dtype: int64
The above information display shows that the missing data in this data has been resolved completely without deleting.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Class | 155.0 | 1.793548 | 0.406070 | 1.0 | 2.00 | 2.0 | 2.0 | 2.0 |
| Age | 155.0 | 41.322581 | 12.743185 | 7.0 | 32.00 | 39.0 | 50.0 | 78.0 |
| Sex | 155.0 | 1.103226 | 0.305240 | 1.0 | 1.00 | 1.0 | 1.0 | 2.0 |
| Steroid | 155.0 | 1.516129 | 0.501360 | 1.0 | 1.00 | 2.0 | 2.0 | 2.0 |
| Antivirals | 155.0 | 1.845161 | 0.362923 | 1.0 | 2.00 | 2.0 | 2.0 | 2.0 |
| Fatigue | 155.0 | 1.354839 | 0.480015 | 1.0 | 1.00 | 1.0 | 2.0 | 2.0 |
| Malaise | 155.0 | 1.600000 | 0.491486 | 1.0 | 1.00 | 2.0 | 2.0 | 2.0 |
| Anorexia | 155.0 | 1.793548 | 0.406070 | 1.0 | 2.00 | 2.0 | 2.0 | 2.0 |
| Liver_Big | 155.0 | 1.838710 | 0.368991 | 1.0 | 2.00 | 2.0 | 2.0 | 2.0 |
| Liver_Firm | 155.0 | 1.587097 | 0.493952 | 1.0 | 1.00 | 2.0 | 2.0 | 2.0 |
| Spleen_Palpable | 155.0 | 1.806452 | 0.396360 | 1.0 | 2.00 | 2.0 | 2.0 | 2.0 |
| Spiders | 155.0 | 1.664516 | 0.473690 | 1.0 | 1.00 | 2.0 | 2.0 | 2.0 |
| Ascites | 155.0 | 1.870968 | 0.336322 | 1.0 | 2.00 | 2.0 | 2.0 | 2.0 |
| Varices | 155.0 | 1.883871 | 0.321418 | 1.0 | 2.00 | 2.0 | 2.0 | 2.0 |
| Bilirubin | 155.0 | 1.452258 | 1.231214 | 0.3 | 0.70 | 1.0 | 1.5 | 8.0 |
| Alk_Phosphate | 155.0 | 104.232258 | 48.609739 | 30.0 | 76.00 | 85.0 | 126.5 | 295.0 |
| Sgot | 155.0 | 84.722581 | 88.783154 | 14.0 | 31.50 | 55.0 | 99.0 | 648.0 |
| Albumin | 155.0 | 3.803871 | 0.674574 | 2.1 | 3.35 | 4.0 | 4.2 | 6.4 |
| Protime | 155.0 | 61.141935 | 21.700571 | 0.0 | 46.00 | 62.0 | 75.0 | 100.0 |
| Histology | 155.0 | 1.451613 | 0.499266 | 1.0 | 1.00 | 1.0 | 2.0 | 2.0 |
Observations:
This section will be looking at the univariate, bivariate and multivariate analysis of the data. This would involve the plotting of data and interpreting the plots. Outliers would also be checked and remediated if necessary.
# Exploring the target variable
data.Class.unique()
array([2, 1])
# Pie chart to represent target data
data.groupby('Class').size().plot(kind='pie', autopct='%1.1f%%')
plt.title('Diagram of Class')
fig = plt.gcf()
fig.set_size_inches(7,7)
plt.show()
The project data did not indicate whether the responses for "LIVE" and "DIE" is 1 or 2. As a result of this, 1 will represent "DIE" and 2 will represent "LIVE". The target variable has 79% of people alive and 21% dying.
#Histogram of all the columns in the data.
all_col_plotted =['Age', 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise']
def plothist (in_data, req_features):
fig, axs = plt.subplots(len(req_features),1,figsize=(15,25))
j = 0
for item in req_features :
in_data[item].hist(ax=axs[j], density=True);
in_data[item].plot.density(ax=axs[j], title=item)
j=j+1
plothist(data, all_col_plotted)
Observations:
#Histogram of all the columns in the data.
all_col_plotted =['Anorexia', 'Liver_Big', 'Liver_Firm', 'Spleen_Palpable', 'Spiders']
def plothist (in_data, req_features):
fig, axs = plt.subplots(len(req_features),1,figsize=(15,25))
j = 0
for item in req_features :
in_data[item].hist(ax=axs[j], density=True);
in_data[item].plot.density(ax=axs[j], title=item)
j=j+1
plothist(data, all_col_plotted)
Observations:
#Histogram of all the columns in the data.
all_col_plotted =['Ascites', 'Varices', 'Bilirubin', 'Alk_Phosphate']
def plothist (in_data, req_features):
fig, axs = plt.subplots(len(req_features),1,figsize=(15,25))
j = 0
for item in req_features :
in_data[item].hist(ax=axs[j], density=True);
in_data[item].plot.density(ax=axs[j], title=item)
j=j+1
plothist(data, all_col_plotted)
Observations:
#Histogram of all the columns in the data.
all_col_plotted =['Sgot', 'Albumin', 'Protime', 'Histology']
def plothist (in_data, req_features):
fig, axs = plt.subplots(len(req_features),1,figsize=(15,25))
j = 0
for item in req_features :
in_data[item].hist(ax=axs[j], density=True);
in_data[item].plot.density(ax=axs[j], title=item)
j=j+1
plothist(data, all_col_plotted)
Observations:
# Uni-variate analysis of numerical variables we want to study their central tendency and dispersion.
# This function takes the numerical column as the input and returns the boxplots and histograms for the variable.
def histogram_boxplot(feature, figsize=(10,10), bins = None):
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
sharex = True, # x-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
) # creating the 2 subplots
sns.boxplot(feature, ax=ax_box2, showmeans=True, color='violet') # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,palette="winter") if bins else sns.distplot(feature, kde=False, ax=ax_hist2) # For histogram
ax_hist2.axvline(np.mean(feature), color='green', linestyle='--') # Add mean to the histogram
ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
histogram_boxplot(data['Age'])
/Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( /Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Observation:
histogram_boxplot(data['Bilirubin'])
/Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( /Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Observations:
histogram_boxplot(data['Alk_Phosphate'])
/Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( /Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Observations:
histogram_boxplot(data['Sgot'])
/Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( /Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Observations:
histogram_boxplot(data['Albumin'])
/Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( /Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Observations:
histogram_boxplot(data['Protime'])
/Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( /Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Observations:
An outlier is a data point that differs significantly from other observations in a dataset. From the above univariate analysis, there was a number of outliners in some of the variables in the data set. The following variables have outliners in them:
# lets plot histogram of all plots of variables with outliners.
all_col = data.select_dtypes(include=np.number).columns.tolist()
all_col.remove('Class')
all_col.remove('Sex')
all_col.remove('Steroid')
all_col.remove('Antivirals')
all_col.remove('Fatigue')
all_col.remove('Malaise')
all_col.remove('Anorexia')
all_col.remove('Liver_Big')
all_col.remove('Liver_Firm')
all_col.remove('Spleen_Palpable')
all_col.remove('Spiders')
all_col.remove('Ascites')
all_col.remove('Varices')
all_col.remove('Histology')
plt.figure(figsize=(17,75))
for i in range(len(all_col)):
plt.subplot(18,3,i+1)
plt.hist(data[all_col[i]])
plt.tight_layout()
plt.title(all_col[i],fontsize=25)
plt.show()
# outlier detection using boxplot
plt.figure(figsize=(20,30))
for i, variable in enumerate(all_col):
plt.subplot(5,4,i+1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
This technique uses the IQR scores calculated to remove outliers. The rule of thumb is that anything not in the range of (Q1 - 1.5 IQR) and (Q3 + 1.5 IQR) is an outlier, and can be removed.
# Lets treat outliers by flooring and capping
def treat_outliers(data,col):
'''
treats outliers in a varaible
col: str, name of the numerical varaible
df: data frame
col: name of the column
'''
Q1=data[col].quantile(0.25) # 25th quantile
Q3=data[col].quantile(0.75) # 75th quantile
IQR=Q3-Q1
Lower_Whisker = Q1 - 1.5*IQR
Upper_Whisker = Q3 + 1.5*IQR
data[col] = np.clip(data[col], Lower_Whisker, Upper_Whisker) # all the values samller than Lower_Whisker will be assigned value of Lower_whisker
# and all the values above upper_whishker will be assigned value of upper_Whisker
return data
def treat_outliers_all(data, col_list):
'''
treat outlier in all numerical varaibles
col_list: list of numerical varaibles
df: data frame
'''
for c in col_list:
data = treat_outliers(data,c)
return data
numerical_col = data.select_dtypes(include=np.number).columns.tolist()
numerical_col.remove('Class')
numerical_col.remove('Sex')
numerical_col.remove('Steroid')
numerical_col.remove('Antivirals')
numerical_col.remove('Fatigue')
numerical_col.remove('Malaise')
numerical_col.remove('Anorexia')
numerical_col.remove('Liver_Big')
numerical_col.remove('Liver_Firm')
numerical_col.remove('Spleen_Palpable')
numerical_col.remove('Spiders')
numerical_col.remove('Ascites')
numerical_col.remove('Varices')
numerical_col.remove('Histology')
data = treat_outliers_all(data,numerical_col)
# lets look at box plot to see if outliers has been treated or not
plt.figure(figsize=(20,30))
for i, variable in enumerate(numerical_col):
plt.subplot(5,4,i+1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
From the diagram above, all the outliers in the data have been removed.
This will be plotting 2 variables to compare their relationships.
# This is bar graph of the target variable 'Class' vrs 'Sex'.
pd.crosstab(data['Class'],data['Sex']).plot.bar(stacked=True)
<AxesSubplot:xlabel='Class'>
Observations:
# This is bar graph of the target variable 'Class' vrs 'Steroid'.
pd.crosstab(data['Class'],data['Steroid']).plot.bar(stacked=False)
<AxesSubplot:xlabel='Class'>
Observations:
# This is bar graph of the target variable 'Class' vrs 'Antivirals'.
pd.crosstab(data['Antivirals'],data['Class']).plot.bar(stacked=False)
<AxesSubplot:xlabel='Antivirals'>
Observations:
# This is bar graph of the target variable 'Class' vrs 'Fitague'.
pd.crosstab(data['Class'],data['Fatigue']).plot.bar(stacked=False)
<AxesSubplot:xlabel='Class'>
Observations:
# This is bar graph of the target variable 'Class' vrs 'Malaise'.
pd.crosstab(data['Class'],data['Malaise']).plot.bar(stacked=False)
<AxesSubplot:xlabel='Class'>
Observations:
# This is bar graph of the target variable 'Class' vrs 'Anorexia'.
pd.crosstab(data['Class'],data['Anorexia']).plot.bar(stacked=False)
<AxesSubplot:xlabel='Class'>
Observations:
# This is bar graph of the target variable 'Class' vrs 'Liver_Big'.
pd.crosstab(data['Class'],data['Liver_Big']).plot.bar(stacked=False)
<AxesSubplot:xlabel='Class'>
Observations:
dfcorr = data.copy()
# Correlation visualisation
sns.set(rc={'figure.figsize':(16,10)})
sns.heatmap(dfcorr.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="YlGnBu")
plt.show()
Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). Regarding the correlation diagram above, the following were my observations:
# Showing pairplot for all the variables.
sns.pairplot(data)
plt.show()
Observation: The diagram above confirms the assertion that correlation between the variables are very low. Some of the relationship between the pairings between the variables is almost does not exist.
sns.pairplot(data,hue='Class')
plt.show()
/Users/danielfiadjoe/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:305: UserWarning: Dataset has 0 variance; skipping density estimate. warnings.warn(msg, UserWarning)
This project has been an extensive exercise of data analysis and visualization of the Hepatitis data provided. The following key observations were made during the data analysis: